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We consider algorithms that maximize a global function G in a distributed manner, using a different adaptive 
computational agent to set each variable of the underlying space. Each agent y is self-interested; it sets its 
variable to maximize its own function g v . Three factors govern such a distributed algorithm’s performance, 
related to exploration/exploitation, game theory, and machine learning. We demonstrate how to exploit all three 
factors by modifying a search algorithm’s exploration stage: rather than random exploration, each coordinate of 
the search space is now controlled by a separate machine-leaming-based “player” engaged in a noncooperative 
game. Experiments demonstrate that this modification improves simulated annealing (SA) by up to an order of 
magnitude for bin packing and for a model of an economic process run over an underlying network. These 
experiments also reveal interesting small-world phenomena. 
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I. INTRODUCTION 

Many systems found in nature have inspired computa- 
tional algorithms for how to maximize a provided high- 
dimensional function. In some of these algorithms the values 
of the underlying coordinates are controlled by separate 
players engaged in a noncooperative game; their equilibrium 
joint state (hopefully) maximizes the provided global func- 
tion G. Examples of such natural systems are auctions and 
clearing of markets. Typically, in the computational algo- 
rithms inspired by such “collectives” of players, each 
“player” is a separate machine-learning algorithm [1,2], e.g., 
a reinforcement-learning (RL) algorithm [ 3 , 4 ]. 

There are three crucial issues concerning such collectives. 
The first is whether the payoff function g v of each player 77 
is sufficiently sensitive to the value of the coordinate 77 con- 
trols in comparison to those of the other coordinates. If this 
is not the case, it is not feasible for 77 to learn how to set its 
coordinate to achieve high payoff. The second crucial issue 
is the need for all of g v to be “aligned” with G, so that as the 
players individually learn how to increase their payoffs, G 
also increases. 

Other collective systems found in nature that have in- 
spired function-maximization algorithms do not involve 
players conducting a noncooperative game. Examples in- 
clude equilibrating spin glasses, genomes undergoing neo- 
Darwinian natural selection, and eusocial insect colonies. 
These have been translated into simulated annealing (SA) 
[ 5 ], genetic algorithms [6], and swarm intelligence [ 7 ], re- 
spectively. The third crucial issue is most prominent in such 
algorithms: the need to tradeoff exploration and exploitation. 

Recent analysis [ 9 ] reveals that a collective’s G value is 
governed by the interaction between these three effects: the 
alignment of g v with G, the “leamability” of g v , and the 
exploration/exploitation tradeoff [ 10 ]. These three issues are 
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traditionally studied in three separate fields: game theory, 
machine leaming/statistics, and optimization theory’, respec- 
tively. So any complete science of collectives must incorpo- 
rate insights from all three fields; no single one of the fields 
suffices. 

Previous work in the Collective INtelligence (COIN) 
framework has partially accomplished this, addressing the 
first two issues. That work is an extension of game-theoretic 
mechanism design, to include off-equilibrium behavior, 
leamability issues, nonhuman g n (e.g., g v for which incen- 
tive compatibility is irrelevant), and arbitrary G [ 11 , 12 ]. In 
domains from network routing to congestion problems 
COIN-based algorithms beat traditional techniques, by up to 
several orders of magnitude [ 4 , 10 ]. 

Here we address all three issues at once, by replacing the 
exploration step in any exploration/exploitation search algo- 
rithm. In the exploration step each separate coordinate of the 
multicoordinate space being searched is made “intelligent,” 
its value being the move of a player/agent designed using the 
COIN framework (rather than the value of a random sample 
of a proposal distribution). We call the class of such algo- 
rithms Intelligent Coordinates for search (IC). 

We concentrate on IC with SA as the exploration-based 
search algorithm. Like SA, IC is intended to be used “off the 
shelf;” rarely will be the best possible algorithm for a par- 
ticular domain. Also like SA, IC is best suited to very large 
problems (so parallelization can be exploited), where there is 
little exploitable gradient information. 

We present experiments comparing IC and SA on two 
archetypal domains: bin-packing and an economic model of 
people choosing formats for their home music systems. In 
bin-packing IC achieves a given value of G up to three or- 
ders of magnitude faster than does SA, an improvement ratio 
that increases linearly with problem size. In the format 
choice problem, each person 77 chooses several formats to 
adopt, and G is the sum of everyone’s “happiness” with their 
move. In turn, 77’s happiness with each of the formats mak- 
ing up her move is set by three factors: which of her nearest 
neighbors on a ring network (77’s “friends”) choose that for- 
mat; 77’s intrinsic preference for that format; and the price of 
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music purchased in that format, inversely proportional to the 
total number of players using that format. Here IC improves 
G two orders of magnitude faster than SA. We also tested an 
algorithm designed to “endogenize externalities,” in the lan- 
guage of economics; IC outperformed it by over two orders 
of magnitude. We also replaced the ring with a small- world 
network [8,13]. This barely improved IC’s performance 
(3%), and had no effect on the other algorithms. However if 
G too was cii£! nged, so that 77 ’s happiness depends on agree- 
ing with her friends’ friends, the improvement is significant 
( 10 %). 

II. SIMPLIFIED THEORY OF COLLECTIVES 

Let z e £ be the joint move of all agents/players in the 
collective, with agent 77 ’s move being z v , and z_ 7 being the 
other agents’ moves. We wish to find the z maximizing the 
provided world utility G(z). We also have private utilities 
{g 7 }, one for each agent 77 . These are the functions that the 
individual agents try to maximize. 

It will be useful to “standardize” utility functions so that 
the value they assign to z only reflects their ranking of z 
relative to other possible points in f. Such a standardization 
is called intelligence, one form of which is 

^ „,£/(*)= j d/j. : _ v (z')&[U(z)~ U(z')], (1) 

where © is the Heaviside function, and where the subscript 
on the (normalized) measure djj, indicates it is restricted to 
z' such that zL v = z~ v . Intuitively, N v U (z) e [0.0, 1.0] is the 
percentile rank of 77 ’s choice of move in comparison to her 
alternatives. The ranking is according to utility U, and is in 
the context of the other agents making move z_ 7 . 

We define N c and N g as the vectors of the intelligences of 
all agents for the world utility and the agents’ separate pri- 
vate utilities, respectively. N vg (z) = 1 means that agent 77 's 
move maximizes its utility, given the moves of the other 
agents. So in game theory terms, N g {z) = 1 means j is a 
Nash equilibrium. Conversely, N c {z')= 1 means that the 
value of G cannot increase in moving from z' along any 
single coordinate of £. 

Indicate the agents’ private utilities by s. Our uncertainty 
about the system induces a distribution P(z|s). Bayes’ theo- 
rem gives us the associated P(Gjs): 

j dN C P(G\N G ,s) f dN g P(N c lN g ,s)P(N g ls). (2) 

This is the central equation. Say that for some s the third 
conditional probability in the integrand is peaked near N s 
= 1 , i.e., s probably induces large (private utility) intelli- 
gences. If in addition the second probability term is peaked 
near N c = N g , then N G is also large. In particular, if s guar- 
antees that N g equals N G identically Vj, then the second 
term is a S function, regardless of P(z|.?). Such a system is 
called factored. Finally, say that in addition to such second 


and third terms, the first term is peaked about high G when- 
ever N c is large. Then as desired, s (probablv) induces high 
G. 

The second term is related to game theory. As an example, 
a team game, where ?.= GV ; 7 is factored [10]. However, 
team games usually have poor third terms, especially in large 
collectives. This is because agent 77 chooses its move based 
on what it can learn about the effect of that move on g v . 
This learning is based on previous observations of what 
value g v had when 77 made its various moves. These obser- 
vations are necessarily made in the presence of the (varying) 
moves of all the other agents. So say g v =G, and the moves 
of the other agents affect G comparably to how much 77 ’s 
move does. Then when there are many of those other agents, 
it will be difficult to distinguish the “signal,” of the effect of 
77 ’s move on the value of g 7 , from the “noise,” of the effect 
of other agents’ moves. This will make it difficult for 77 to 
learn how to make a move that has high intelligence for g v . 

Term three is related to machine learning. Say that we IID 
sample the values that the utility g v has whenever the move 
made by agent 77 is zi, and similarly for when that move is 
z 2 . This gives us a training set of move-utility pairs n v . The 
associated leamability, A (I/;« 7 ,z 7 ,z 7 ) is defined as 

\E(g v {z\,-)\n v )~ E{g ,-)\n v )\ 

VVar[g 7 (zJ 7 ,-)l«,] + Vai[g 7 (z^,-)|«,]’ 

where the expectations and variances have 77 ’s move fixed as 
indicated, with z~ v varying according to P(z_ 7 |/i 7 ). 

The denominator in Eq. (3) reflects the sensitivity of 
g v (z) to Z- v , while the numerator reflects its sensitivity to 
z v . So the greater A(g 7 ;n 7 ,z 7 ,z 7 ), the more g 7 (z) de- 
pends on which of those two moves agent 77 adopts, in com- 
parison to its dependence on the joint move of the other 
agents. In other words, for larger leamability it is easier for 77 
to distinguish in n 7 how its choice between those two moves 
affects g v (the signal), from how other agents affect g 7 (the 
noise). Formally, the expected intelligence of agent 77 ’s move 
is an increasing function of the value of A (g v ',n r) ,z\ l ,Z 2 rj ) 
for all candidate pairs of moves, z\ ,z 7 . So if s specifies a g 7 
with high leamability n 7 , term 3 will have the desired form. 

A difference utility has the form g n (z) — G(z) 

- D{z ~ v )• Any such utility is factored [9]. The D(z _^) that 
maximizes A z 7 ,z 7 ), for all pairs z\,z~ v , is 
/^z,/(z 7 )G(z 7 ,z- 7 ) [8,9] (the form of the distribution /is 
beyond the scope of this paper). The associated g v is called 
the aristocrat utility (AU). If 77 ’s private utility is AU rather 
than G, then the last two terms of the central equation are 
more biased towards higher world utility values. Another 
advantage is that AU is often easier to evaluate than is G 
[9,10], 

The “wonderful life” utility (WLU) is an approximation 
of AU that avoids calculating expectation values by adopting 
S function / 

WLU„»G(z)-G(z-„,C,), (4) 

where C v is the clamping parameter. The choice of C 7 can 
be set to maximize leamability. (How best to do this is be- 
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yond the scope of this paper; see beiow.) While not matching 
that of AU, for most C , values WLU's leamability is better 
than a team game's. 

Finally, one way to address term 1 of the central equation 
is to incorporate exploration/exploitation techniques like SA. 
We describe how to do this beiow. 

m. EXPERIMENTS 

In our version of SA, at the beginning of each time step t 
a proposal distribution h v (z v ) is formed for every separate 
coordinate 77 . Each such distribution assigns probability 75% 
to the move 77 had at the end of the preceding time step, 
7 . tjj— 1 , and uniformly divides probability 25% across the 
other moves. The “exploration” joint move z exp i is then 
formed by simultaneously sampling all the h rj . If G(z exp i) 
>G(z,-j) t Z VJ is set to z exp i . Otherwise z, is set by sam- 
pling a Boltzmann distribution over the two z ' s for “ener- 
gies” — G(z,~i ) and ~G{z exp i ), respectively. Many differ- 
ent annealing schedules were investigated; all results below 
are for best schedules found. 

IC is identical except that each h v is replaced by 
h v (z n )c V ' t (z v )i'Z z 'h v {z' v )c Vtt {z' v ). To form c m „ for each 
of its possible moves z v , agent 77 collects those instances in 
its training set for which the move was z v ■ It then forms a 
weighted average of the associated g v values recorded in the 
training set. (The weights decay exponentially with how old 
that pair is to reflect nonstationarity of the system.) This 
gives an estimate of what g n is likely to be for each move 
z v . c VJ then is the Boltzmann distribution over the possible 
z v , parameterized by a “learning temperature,” with the 
“energy” for each z n set to the associated estimate of what 
g v is likely to be. (COIN algorithms just use this Boltzmann 
distribution to give z, direcdy, with no proposal distribution, 
keep/reject step, etc.) 

In all our experiments AU used a mean-field approxima- 
tion to pull the expectation inside the G( ■ ) in the evaluation 
of D(z~ v ). Unless otherwise specified, for simplicity, C 7) 
(used in V/LU) was set to 0. 

In the bin-packing problem, N items, all of size <c, must 
be assigned into a minimal subset of N bins, without assign- 
ing a summed size >c to any one bin. G of an assignment 
pattern is the negative of numbe r of occupied bins [14]. and 
each agent controls the bin choice of one item. All algo- 
rithms use a modified “G,” G soft , even though their perfor- 
mance is ultimately measured with G 



where x, is the summed size of all items in bin i. 

In the IC runs learning temperature was 0.2, and all agents 
made the transition to RL-based moves after a period of 100 
purely random z ’s that was used to generate the initial train- 
ing sets {n 7 }. Exploitation temperature started at 0.5 for all 


TABLE I. BLn-packing G at time 200 for N'=20,c= 12. 


Algorithm 

Average G 

Best 

Worst 

% Optimum 

IC WLU 

-3.32+0.22 

2 

8 

12 % 

IC TG 

-7.84±0.17 

6 

10 

0 % 

COLN WLU 

-3.52+0.20 

2 

7 

64% 

COIN TG 

-7.84rO.I5 

6 

9 

0 % 

SA 

— 6.00±0.19 

4 

7 

0 % 


algorithms, and was multiplied by 0.8 every 100 exploitation 
time steps. In each SA run, h was slowly modified to gener- 
ate solutions that differed less from the current solution as 
time progressed. 

In Table I “Best” and “Worst” give the extremal end-of- 
the-run G values (25 runs total), and “%Optimum” the per- 
centage of runs within one bin of the best value. Figure 1 
illustrates that two of the algorithms that account for both 
terms 2 and 3 — IC WLU and COIN WLU — far outperform 
the others, with the algorithm accounting for all three terms 
doing best. The worst algorithms were those that accounted 
for only a single term (SA and COIN TG). Linearly (i.e., 
optimistically) extrapolating SA’s performance from time 
15 000 indicates it would take over 1000 times as long as IC 
WLU to reach the G value IC WLU reaches at time 200. In 
addition the ratio of WLU’s time 1000 performance (minus 
that of random search) to SA’s grows linearly with the size of 
the problem. Finally, Fig. 2 illustrates that the benefit of ad- 
dressing terms 2 and 3 grows with the difficulty of the prob- 
lem. In both figures SA outperforms IC-TG due to its ben- 
efiting from more parameter tuning. 

For the format choice problem G is the sum over all N a 
agents 77 of 77 ’s “happiness” with its music formats: G 
= 2 *1 1 1 2 v' e neigh 0 <» v . r,’ ,iPref VP ■ where N f is- the 

number of formats; neigh v is the set of players =SZ> hops 
away from player 77 ; pref v i is 77 ’s intrinsic preference for 
format i (randomly fixed e[ 0 ,l]); i}( i) is the total number 
of players choosing format i (i.e., the inverse price for 
format 7); and w iriv i=\ if the choice of players 77 and 
77 ' both include format i, 0 otherwise. 77 ’s move says 



Time 

FIG. 1 . Average bin-packing G for N= 50, c = 1 0. All error bars 
=£0.31 except IC-AU and COIN-AU are =£0.57. 
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FIG. 2. G vs c for IV =20 at f = 200. AH error bars =£0.34. 


which of four formats not to use. For example, WLU^ 

— 77,1,77' e ') ^ j)' ,i P ^ lf 77,1 ’ where £1 jj/ 'j 1 if 

agent 77 ' ’s choice set includes i but agent 77 ’s do not, and 
equals 0 otherwise. Both D = 1 and 3 were investigated. 

In Fig. 3, ‘TC Econ” refers to WLU IC, with clamping 
making the agent decline all formats. It is a crude way of 
endogenizing externalities, and then rescaling and interleav- 
ing with SA to improve performance. “IC-WLU” instead 
clamps 77’s move to zero (as per the theory of collectives), so 
that 77 chooses all formats. Learning temperature was 0.4, 
and exploitation temperature was 0.05 (annealing provided 
no advantage since runs were short). We used 777 -node ring 
topologies with an extra 0.06-m random links added, a new 
such set for each of the 50 runs giving a plotted average 
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FIG. 3. G(f=200) for 100 agents. In order from left to right, 
£> = {1, 1,3,3}, and topologies are {L,W,L,W}. 

value. “Short links” (L) means all extra links connected 
players two hops apart, and “small-worlds” (W) means there 
was no such restriction. 

IC Econ’s inferior performance illustrates the shortcoming 
of economicslike algorithms. For 0=1 SA did not benefit 
from small worlds connections, and IC variants barely ben- 
efited (3%), despite the associated drop in average intemode 
hop distance. However if D also increased, so that G directly 
reflected the change in the topology, then the gain with a 
small worlds topology grew to 10%. 
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